A First Computational Frame for Recognizing Heparin-Binding Protein

Heparin-binding protein (HBP) is a cationic antibacterial protein derived from multinuclear neutrophils and an important biomarker of infectious diseases. The correct identification of HBP is of great significance to the study of infectious diseases. This work provides the first HBP recognition framework based on machine learning to accurately identify HBP. By using four sequence descriptors, HBP and non-HBP samples were represented by discrete numbers. By inputting these features into a support vector machine (SVM) and random forest (RF) algorithm and comparing the prediction performances of these methods on training data and independent test data, it is found that the SVM-based classifier has the greatest potential to identify HBP. The model could produce an auROC of 0.981 ± 0.028 on training data using 10-fold cross-validation and an overall accuracy of 95.0% on independent test data. As the first model for HBP recognition, it will provide some help for infectious diseases and stimulate further research in related fields.


Introduction
Heparin-binding protein (HBP), also known as azurocidin or CAP-37, is a cationic antimicrobial protein derived from the granulosa protein of polynuclear neutrophils [1][2][3]. Studies have found that the biosynthetic HBP in neutrophils is rapidly released under bacterial stimulation, leading to increased vascular permeability and edema [4,5], and has a proinflammatory effect on a variety of leukocytes and epithelial cells [6]. Therefore, HBP in plasma can be used as a new diagnostic marker for bacterial skin infection, acute bacterial meningitis, leptospirosis, protozoan parasites, and even some noninfectious diseases [4,[7][8][9][10]. Especially for sepsis, a systemic inflammatory response syndrome caused by infection, HBP is an effective early and predictive biomarker [11][12][13]. In fact, it has been found that HBP levels in plasma are elevated in septic patients a few hours before the onset of hypotension or organ dysfunction [14].
The correct recognition of HBP can provide important clues for the study of biomarkers of infectious diseases. Traditional molecular biology methods can provide accurate information to study HBP [15,16]. However, these experiments require a longer cycle, more experimental resources, and more expensive manpower. The continuous accumulation of biological data provides a basis for us to mine potential biological knowledge from cycle, more experimental resources, and more expensive manpower. The continuous accumulation of biological data provides a basis for us to mine potential biological knowledge from these data [17][18][19][20][21][22]. The continuous progress of various data analysis methods and artificial intelligence technology provides a favorable tool for us to obtain knowledge [23]. In fact, machine learning methods have been widely used in the recognition of special functional proteins [20,[24][25][26][27][28][29][30][31][32][33], for example, bioluminescent protein [34], hormone-binding protein [35], and transcription factors [36][37][38][39][40]. In these works, several kinds of sequence descriptors, such as amino acid composition (AAC), reduced amino acid composition (RAAC) [41][42][43], pseudo amino acid composition (PseAAC) [20,44], and dipeptide composition (DC) [35], were developed.
Although research on these special functional proteins has been successful, to our knowledge, there is still no computational prediction work for HBP recognition at present because there was a lack of available datasets in the past, and people had previously paid more attention to the research of molecular biology experiments. Thus, it is urgent to develop an efficient prediction model to identify HBP.
This work aims to build a powerful computational model to identify HBP. At first, a reliable benchmark dataset was collected and constructed for training and testing various computational models. Subsequently, four sequence descriptors were adopted to formulate sequence samples. Two kinds of machine learning methods, namely, support vector machine (SVM) [45] and random forest (RF) [46], were selected as classifiers for executing classification. The following sections provide a detailed description of the workflow (Figure 1).

Benchmark Dataset Construction
In biological macromolecular classification and recognition, a reliable benchmark dataset is the foundation for constructing a reliable model [47][48][49][50]. It is well known that the Universal Protein Resource (UniProt) is a comprehensive resource for protein sequence and annotation data [51]. This database provides abundant protein information. Therefore, the raw HBP data were collected from UniProt by using heparin binding (KW-0358) as keyword. In UniProt, there are five forms of evidence for protein existence, that is, evidence at the protein level, evidence at the transcriptional level, evidence from homology, predicted, and uncertain. Obviously, the proteins with evidence provided by the first two have higher reliability, so HBPs with the other three kinds of evidence were excluded. In addition, sequences that have ambiguous residues, such as "B", "J", "O", "U", "X", and "Z" were checked and excluded. Finally, a total of 391 HBPs were obtained.

Benchmark Dataset Construction
In biological macromolecular classification and recognition, a reliable benchmark dataset is the foundation for constructing a reliable model [47][48][49][50]. It is well known that the Universal Protein Resource (UniProt) is a comprehensive resource for protein sequence and annotation data [51]. This database provides abundant protein information. Therefore, the raw HBP data were collected from UniProt by using heparin binding (KW-0358) as keyword. In UniProt, there are five forms of evidence for protein existence, that is, evidence at the protein level, evidence at the transcriptional level, evidence from homology, predicted, and uncertain. Obviously, the proteins with evidence provided by the first two have higher reliability, so HBPs with the other three kinds of evidence were excluded. In addition, sequences that have ambiguous residues, such as "B", "J", "O", "U", "X", and "Z" were checked and excluded. Finally, a total of 391 HBPs were obtained.
Because protein sequences with high similarity will reduce the scalability of the prediction model, those proteins with high similarity must be removed. In general, 40% or 25% is the commonly used threshold of sequence identity when constructing prediction models of special functional proteins. However, due to the limitation of the number of samples, if such sequence identity threshold is used, the number of samples will not be statistically significant, which will lead to the loss of objectivity of the model. Therefore, to balance the number of samples and sequence identity, 80% was adopted as the threshold of sequence identity. The software that performs redundant sequence removal is CD-HIT [52]. As a result, 183 HBPs were kept as positive samples.
In the prediction of special functional proteins, the selection of negative samples is a very difficult task. If all proteins in UniProt that are not annotated as HBP are selected, the data are huge, and the negative samples are almost 3000 times that of the positive samples, which is extremely unfavorable to the construction of the prediction model. In addition, the functional annotation of many of these proteins is not complete. Some proteins may be HBP, but they have not been identified before, which can also lead to bias in a machine learning model. Therefore, to avoid the above two problems as much as possible, the following steps were carried out to select negative samples. First of all, human DNA-binding proteins were chosen as candidate negative samples, because they differ greatly from positive samples in biological functions, which can avoid the problems mentioned above. To improve the reliability of negative samples, those DNAbinding proteins that have structural information, the existence of evidence at the protein level, and a sequence identity of >80% were selected. Despite such stringent criteria, 559 negative sample sequences were obtained, which is more than positive samples. To balance positive and negative samples, 183 of them were randomly selected as the final negative sample dataset.
Based on such benchmark dataset, 50 positive samples and 50 negative samples were randomly selected as test data, and the remaining as training data (133 positives and 133 negatives), formulated as

Formulation of Protein Sequences
The method for special function protein recognition based on the machine learning method is to classify the samples according to their characteristics in the benchmark dataset [53][54][55][56][57]. Protein is a sequence composed of 20 amino acids with different lengths. However, machine learning requires that every sample should have the same dimension of features. Therefore, how to transform the protein sequence into a discrete digital vector is a key problem for classification model construction. The feature vector should be able to effectively characterize the basic attributes of these samples without losing the original information [58]. In fact, in the past 30 years, scholars have developed a variety of sequence representation methods. Some of these features have good universality, such as pseudo nucleotide composition and position-specific scoring matrix, which have been successful in many protein prediction problems. Additionally, some features have strong specificity, such as amino acid composition, which is very suitable for the recognition of thermophilic proteins. In any previous work, it has not been pointed out what features are used to characterize HBP samples. Therefore, in order to describe the sequence attributes of HBP from multiple perspectives, the following four descriptors were used to extract the features of protein sequences, described as follows.

Amino Acid Composition (AAC)
Although AAC does not perform well in many prediction problems of predicted proteins, it can be used as a supplement to other features as a basic feature [59]. Therefore, this paper also uses this feature to test its prediction performance for HBP.
For any protein sequence expressed as P = R 1 R 2 , · · · , R L , where L is the length of the protein and also is the number of residues. R i (i = 1, 2, · · · , L ) is the residue at the i-th position in the sequence. R belongs to 1 of 20 amino acids (R∈(A, C, . . .,Y)). Then the AAC is the probability that 20 kinds of amino acids appear in this protein, that is, the number of 20 kinds of amino acids divided by the total number of amino acids in this protein (namely, the sequence length L). The following formula was used to express the AAC: where N R is the total number of amino acids R in the given sequence P. Then this protein can be expressed by the feature vector as where 20 denotes the dimension of the vector.

Dipeptide Composition (DC)
Studies have found that the linkage between amino acids is not random. A certain amino acid is often followed by another relatively fixed type of amino acid; that is, the arrangement of amino acids in proteins is also a unique feature of proteins. In fact, when people study the information storage of the genome, they also find that the adjacent association of nucleotides is the main way of genetic information storage. Therefore, it can also be speculated that the order of amino acids in protein sequences is one way of protein information storage. The order information of adjacent amino acids, also called DC, in proteins is still an important feature to characterize amino acids, and has been widely used in protein classification.
Since proteins have 20 kinds of normal amino acids, there are 400 kinds of dipeptides [60]. We need to count the numbers of these 400 dipeptides in the protein, and then calculate their frequencies in the whole sequence. The calculation formula is as follows: where N R f R b is the total number of the dipeptides R f R b in the given sequence P. f and b represent the front residue R and back residue R in the dipeptide R f R b . Then the feature vector for this protein can be expressed by where 400 denotes the dimension of this vector.

Dipeptide Deviation from Expected Mean (DDE)
There are 4 nucleotides and 20 amino acids in the genome. The 3 nucleotides are connected to form a codon to encode amino acids or become stop codons. Since 20 kinds of amino acids are encoded by 61 codons, there is degeneracy; that is, one kind of amino acid is encoded by multiple codons. Therefore, for any protein sequence, the theoretical frequency of dipeptide appearance can be described by the coding degeneracy of codons, which is defined as follows: where C R f and C R b are the numbers of codons that code for the first amino acid residue and the second amino acid residue in the given dipeptide "R f R b ". C N is the total number of possible codons after excluding the three stop codons (C N = 61).
Then the theoretical variance of the dipeptide "R f R b " can be defined as Diagnostics 2023, 13, 2465

of 14
The Z-transform is performed between the observed dipeptide frequency (defined in Equation (4)) and the theoretical dipeptide frequency (defined in Equation (6)) in a sequence, as shown below.
Equation (8) describes the deviation of the observed dipeptide frequency from the theoretical dipeptide frequency and is thus called DDE. Protein sample vectorization is described as where 400 denotes the dimension of this vector.

Composition/Transition/Distribution (CTD)
Usually, some fragments in a protein chain will form a specific secondary structure or have some special biological activities. Many attempts have been made to describe these fragment features effectively. Among them, CTD is one of the more effective ways to represent the amino acid distribution patterns of a specific structural or physicochemical property in a protein or peptide sequence. Thus, in this work, it is also used for feature extraction to express protein samples.
Amino acid itself is a chemical molecule with specific physicochemical properties. According to the physicochemical properties of amino acids, the frequency of amino acids in each group of properties for a sample sequence (expressed as C) can be redescribed. Amino acids with certain characteristics may form a fragment, such as a continuous hydrophilic fragment exposed on the protein surface. However, the next few amino acids may have other properties. Therefore, T measures the frequencies of property change of amino acids compared with the immediately adjacent amino acids in the sample sequence. D was proposed to character the distribution patterns of the first 25%, 50%, 75%, and 100% of the sample sequence. Details of features are in the following section.
According to the previous studies, 13 physicochemical properties of amino acids are selected for the next characterization. For each property, these 20 amino acids are divided into three categories, such as, for the secondary structure, they can be divided into helix, strand, and coil. Then, 20 amino acids can be divided into 39 total (13 × 3) groups. The percentage of each group in protein sequence is defined as follows: where n i,j is the number of residues in the i-th group of the j-th physicochemical property. Therefore, this descriptor, also known as CTDC, is used to describe the protein sequence as where 39 denotes the dimension of this vector. CTDT represents the transition probability of two adjacent amino acid residues belonging to two different groups, which can be calculated by the following formula: where n R f R b and n R b R f are the numbers of the dipeptides "R f R b " and "R b R f ", respectively, while R f and R b are amino acids in the i-th group and not. Then, vectorization is used to represent the protein as where 39 denotes the dimension of this vector. The relative location in one sequence-represented distribution of residues of given groups can be described by CTDD. Considering that a protein sequence is divided into 5 segments according to percentages of 1%, 25%, 50%, 75%, and 100%, the number of amino acids with the j-th physicochemical property in group i in each segment can be expressed as where p is 1, 25, 50, 75, and 100. When n p i,j is less than 1, it is assigned a value of 1. Then, CTDD can be represented as where loc n p i,j denotes the location at the sequence that the occurrence number of residues of a given group reaches n p i,j . Then, the feature vector of CTDD can be expressed as where 195 denotes the dimension of this vector. By combining the three features CTDC, CTDT, and CTDD (Equations (11), (13), and (16)), a protein sample could be formulated as a 39 × (2 + 5) = 273 dimensional vector shown as follows: It should be pointed out that although the three CTD features are mixed in many protein prediction works, the three features exist independently of each other, so they can also be used independently for protein prediction.

Machine Learning Methods
How to find appropriate decision conditions in the feature space, and then distinguish different types of samples, is the third step in the biological macromolecule recognition problem [61][62][63][64][65][66]. Machine learning methods can provide appropriate classification decision criteria to distinguish different types of samples [67]. They have been widely applied in bioinformatics [68][69][70][71][72][73][74][75][76]. At present, deep learning has become a popular method. However, it requires a lot of computing resources and needs more experience to search parameters. Thus, in this work, two popular algorithms, namely, support vector machine (SVM) and random forest (RF), which are very suitable for small samples, were only considered.
SVM is a typical representative of machine learning suitable for small-sample learning. Its principle involves utilizing the kernel function to transform low-dimensional samples into high-dimensional feature space, and then find the hyperplane that can distinguish samples in high-dimensional space. Since most problems in biology are nonlinear subproblems, radial basis function (RBF) are most commonly used. For a detailed introduction to SVM, please refer to the literature.
RF is an integrated learning method based on a decision tree, which can be regarded as an upgrade of a decision tree. Its principle is to construct multiple decision trees for classification during training, and its output is the category selected by most trees. RF avoids the overfitting of a decision tree when building a model.

Evaluation Indexes
After the model is built, the performance of the model needs to be evaluated. In this study, stratified 10-fold cross-validation without shuffle was performed on training data to fine-tune parameters and test models [77][78][79][80][81][82][83][84]. The grid search method was applied to search for the best parameters of the model in search spaces (Table 1). Additionally, the independent data were utilized to test the final model after the best model was established on the training data. Tenfold cross-validation and independent set test are evaluation strategies for accessing prediction ability. Models also need specific evaluation indicators to evaluate [85][86][87][88][89][90].
Here, the prediction ability of the model was evaluated by using sensitivity (Sn), specificity (Sp), overall accuracy (OA), Matthews correlation coefficient (MCC), and area under the receiver operating characteristic curve (auROC): where TP and TN are the numbers of correctly predicted HBPs and non-HBPs, respectively. FP denotes the number of non-HBPs that were recognized as HBPs, while FN denotes the number of HBPs that were identified as non-HBPs. Additionally, the auROC can quantitatively evaluate the performance of the model. Thus, as there are several metrics, the auROC is the first metric to be considered. The greater the auROC, the better the performance of the model is. If there are models that have the same auROCs, OA can be the second metric to be considered. To be more precise, MCC can be the next metric that represents the performances of models.

Experiments on Training Data
Four kinds of sequence feature extraction strategies were introduced in the above section. According to their definition, each protein can be described as 20 dimension, 400 dimension, 400 dimension, and 273 dimension vectors, respectively, for AAC, DC, DDE, and CTD. Analysis of variance (ANOVA) was applied for ranking the top 5 features of four feature descriptors (Table 2). AAC, DC, and DDE have a common high F-score and p-value about amino acid S (Serine). These results also reveal the extremely different composition between HBPs and non-HBPs. Additionally, according to the high score features of CTD, the differences between HBPs and non-HBPs are mainly concentrated on solvent access, hydrophobicity, polarity, and second structure properties. Next, the prediction performance of each kind of feature on training data using SVM and RF was investigated. The ROC curves of 10-fold cross-validation are plotted in Figure 2. From Figure 2a, one may notice that DDE could produce a maximum auROC of 0.981 ± 0.028 among the four kinds of features when using SVM as classifier. However, the best feature is CTD for RF, as shown in Figure 2b. The auROC is 0.966 ± 0.033. Especially for RF, DDE is the worst feature, which can only achieve an auROC of 0.949 ± 0.049. For AAC and DC, they could produce similar results no matter what kind of classifier was adopted. Through overall comparison, the results of SVM combined with DDE are better. Therefore, this model has the greatest potential to become the ultimate HBP prediction model.  Next, the prediction performance of each kind of feature on training data using SVM and RF was investigated. The ROC curves of 10-fold cross-validation are plotted in Figure  2. From Figure 2a, one may notice that DDE could produce a maximum auROC of 0.981 ± 0.028 among the four kinds of features when using SVM as classifier. However, the best feature is CTD for RF, as shown in Figure 2b. The auROC is 0.966 ± 0.033. Especially for RF, DDE is the worst feature, which can only achieve an auROC of 0.949 ± 0.049. For AAC and DC, they could produce similar results no matter what kind of classifier was adopted. Through overall comparison, the results of SVM combined with DDE are better. Therefore, this model has the greatest potential to become the ultimate HBP prediction model.

Experiments on Independent Data
On the training data, the prediction results of two algorithms combined with four features were investigated. To further confirm whether SVM combined with DDE is the best prediction model, eight models were tested on independent data (Table 3). It shows that SVM with DDE also has the best OA of 95.0%, with a balanced Sn and Sp of 96.0% and 94.0%, respectively. Other combinations also have the best OA of 95.0%. However, the final model will be chosen according to the highest auROC. Table 3. Results of models on the independent data using different algorithms and feature descriptors.

Algorithm
Feature The eight ROC curves are plotted in Figure 3. For SVM-based models, there are two features, DDE and CTD, both of which obtain the maximum auROC. Different from the SVM-based model, the best feature of RF, still CTD, has not changed in the training data and independent data. Additionally, the best parameters for SVM with DDE are shown in Table 4.

Experiments on Independent Data
On the training data, the prediction results of two algorithms combined with four features were investigated. To further confirm whether SVM combined with DDE is the best prediction model, eight models were tested on independent data (Table 3). It shows that SVM with DDE also has the best OA of 95.0%, with a balanced Sn and Sp of 96.0% and 94.0%, respectively. Other combinations also have the best OA of 95.0%. However, the final model will be chosen according to the highest auROC. Table 3. Results of models on the independent data using different algorithms and feature descriptors.

Algorithm
Feature The eight ROC curves are plotted in Figure 3. For SVM-based models, there are two features, DDE and CTD, both of which obtain the maximum auROC. Different from the SVM-based model, the best feature of RF, still CTD, has not changed in the training data and independent data. Additionally, the best parameters for SVM with DDE are shown in Table 4.

Further Discussion
Based on the above results on training data and independent data, a very good model for HBP recognition was obtained. However, one must realize that the sequence similarity

Further Discussion
Based on the above results on training data and independent data, a very good model for HBP recognition was obtained. However, one must realize that the sequence similarity of the benchmark dataset used in this model is relatively high. Generally speaking, building models based on low-similarity datasets has better robustness and scalability. However, the number of current samples is not enough to support us in generating such data for building a model. Therefore, it is our direction to constantly collect new data to expand the samples of the model.
In this work, four features were adopted to encode samples, and numerous sequence features were developed. However, some relatively simple features were utilized to obtain satisfactory prediction accuracy, demonstrating that HBP sequences have their special characteristics. In addition, feature fusion and feature filtering are also commonly used to improve model accuracy, while not applied in this work since the current features have generated considerable prediction performance. Now, although the data size is small, it is sufficient for small sample learning algorithms such as SVM and RF to build light models with pretty good performances. This is also why this first computational work for HBP recognition comes out. Of course, another possible reason for the absence of computational methods is that researchers always focus on experimental methods and ignore them.
With the data size increasing, feature fusion and feature filtering will be implemented when a single feature cannot fully describe the sample characteristics. If there are enough data, more machine learning algorithms can be considered for comparison. Deep learning is now very popular in bioinformatics. However, this algorithm requires more computational resources and also has certain requirements for sample size and feature dimensions.
The identification HBP in medical plasma must consider that the proteins have to obtain their sequences, since our model was constructed based on sequence features. Once sequences of proteins are obtained, feature extraction and model prediction can be conducted, and the results can be produced within a few seconds. Additionally, this procedure only consumes some computing resources-even on mobile phones-without laboratory resources. Compared with molecular biology methods, for example, enzymelinked immunosorbent assay [91] is based on antibodies of known HBPs, which takes several hours to complete recognition. Computational methods save time and resources. However, there are some limitations, such as false predictions, and the model cannot predict the affinity of heparin binding, which is essential for medical use. In the future, with the development of computational methods and computing resources, more accurate recognition, more functional prediction, and faster processing speed will be achieved. Then more HBPs from various species can be identified, which aids in the research of infectious disease biomarkers.
HBP is an important biomarker of infectious diseases. The correct identification of HBP is of great significance for the study of infectious diseases. The construction of this model will provide clues for the identification of important biomarkers of infectious diseases and the discovery of potential drug targets. This work also contributes to the wider application of artificial intelligence methods in the field of clinical medicine, especially in the identification of biomarkers.

Conclusions
Four kinds of sequence features were extracted for HBP, and two machine learning methods, SVM and RF, were evaluated. Eventually, DDE combined with SVM was chosen to construct the final prediction model. The model shows good prediction results on both the training set and independent set. To our knowledge, this is the only HBP recognition model based on machine learning. This model is slightly rough, but it provides pioneering research on the use of artificial intelligence methods to study HBP. It is hoped that a more in-depth and detailed analysis of HBP can be carried out in the future. Other data are available upon request to the corresponding author.

Conflicts of Interest:
The authors declare no conflict of interest.